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METHOD OF EXECUTING AN or immediate arguments following the operation; they are 

INTERPRETER PROGRAM operands for the operation. Suitable examples of a standard 

form in which the program to be interpreted can be described 
are the Java byte code and the P-code into which a Pascal 

BACKGROUND OF THE INVENTION 5 program is translated. 

The invention relates to a method of executing a threaded Program execution on the basis of interpretation of the 

interpreter for interpreting a program comprising a series of program to be executed is slower than on the basis of a 

program instructions, the method comprising for the execu- compiled program. In the latter case, the program is trans- 

tion of each program instruction: a plurality of preparatory lated in advance and stored in the form of machine instruc- 

steps making the program instruction available in the 30 tions directly executable by the processor. In case of 

threaded interpreter, and an execution step emulating the interpretation, at least the final phase of the translation is 

program instruction. done at runtime by the interpreter running on the processor 

The invention also relates to a system for executing a and usin S resources and time of the processor. This makes 

threaded interpreter interpreting a program comprising a the execution of a program on the basis of an interpreter 

series of program instructions, the system comprising: a 15 slower. The article 'Interpretation Techniques', Paul Klint, 

memory for storing the series of program instructions, and Software— Practice and Experience, Vol. 11, pages 

the threaded interpreter comprising a preparatory unit for 963-973, September 1981, describes a so-called threaded 

executing a plurality of preparatory steps making a partial- interpreter, which is a relatively fast interpreter that does not 

lar program instruction available in the threaded interpreter, require techniques which are costly in respect of memory. A 

and an execution unit for emulating the particular program 20 threaded interpreter contains a block of machine instructions 

instruction. f° r eacn of the program instructions to be interpreted and 

The invention also relates to a data carrier comprising a executed. Such a block contains the following elements: 
threaded interpreter for interpreting a program comprising a emulation code for the program instruction, i.e. one or 
series of program instructions, the threaded interpreter com- more machine instructions to be executed by the pro- 
prising: a preparatory unit for executing a plurality of 25 cessor for reahzing the purpose of the program instruc- 
preparatory steps making a particular program instruction Uon i 

available in the threaded interpreter, and an execution unit a fetch instruction for fetching the next program instruc- 

for emulating the particular program instruction. tion to be executed; 

The invention also relates to a system for generating an 30 a decode instruction for decoding that program instruction 
executable interpreter for interpreting a program comprising after it has been fetched; 
a series of program instructions, the system comprising a a jump to the block of that program instruction, 
compiler for translating the interpreter from a source code The threaded interpreter can be seen as several of these 
into machine instructions, the interpreter in the source code blocks in parallel. The threaded interpreter has a block for 
comprising: a-preparatory unit for executing at least one 35 each kind of program instruction that has to be interpreted, 
preparatory step making one of the program instructions e.g. 256 blocks when 256 different program instructions are 
available in the interpreter, and an execution unit with an supported. After the execution of a certain block, a jump is 
emulation code for emulating one of the program instruc- made to the block implementing the next program instruc- 
tions, tion to be executed. Then this block is executed and again a 
The invention also relates to a data carrier comprising a 40 jump is made to the block of the then next program instruc- 
compilcr for generating an executable interpreter for inter- tion and so on. 
preting a program comprising a series of program 
instructions, the compiler being arranged to translate the 



SUMMARY OF THE INVENTION 



interpreter from a source code into executable machine It is an object of the invention to provide a method of the 
instructions, the interpreter in the source code comprising: a 45 kind set forth which is comparatively faster than the known 
preparatory unit for executing at least one preparatory step method. This object is achieved according to the invention 
making one of the program instructions available in the in a method which is characterized in that during the 
interpreter, and an execution unit with emulation code for execution of the interpreter on an instruction-level parallel 
emulating one of the program instructions. processor machine instructions implementing a first one of 
It is known to execute a program by means of an inter- 50 the preparatory steps are executed in parallel with machine 
preter. Interpretation is a program execution technique instructions implementing a second one of the preparatory 
where, as opposed to the execution techniques using a steps for respective ones of the series of program instruc- 
compiler, the program is not translated in advance into a tions. Executing the machine instructions for two of the 
form suitable for direct execution by a specific processor. preparatory steps in parallel, each step being executed for its 
The program to be executed is described in a standard form 55 own program instruction, makes that at least two different 
which is not dedicated to a specific processor. An interpreter, program instructions are being executed simultaneously, 
being a program specific for the processor at hand, reads a This significantly improves the speed of program execution, 
program instruction of the program to be executed and because it is no longer necessary to execute all required 
analyses this program instruction. Subsequently, the inter- machine instructions in a single and hence longer sequence, 
preter determines what actions must be taken and has these 60 Parallel processing of instructions is known per se. It is 
actions executed by the processor. Reading a program described, for example, in the article 'Instruction-Level 
instruction and execution of the corresponding machine Parallel Processing: History, Overview, and Perspective*, B. 
instructions are carried out in an alternating fashion, without Ramakrishna Rau and Joseph A. Fisher, The Journal of 
storing the translated program instructions in an intermedi- Supercornputing, 7, pages 9-50, May 1993. In particular 
ate formal. A program instruction has an operation code that 65 page 19 of that article describes instruction-level parallel 
indicates the type of operation to be carried out, e.g. an add processing on a VLIW (Very Long Instruction Word) pro- 
operation. Furthermore, a program instruction may have one cessor. Such a processor has a number of slots and an 
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instruction may be placed in each slot. The instructions The data carrier comprising the threaded interpreter 
together form the so-called very long instruction word, according to the invention is characterized in that the 
which is executed by the processor as a single (very long) threaded interpreter is arranged to have machine instructions 
instruction. This results in the parallel processing of the implementing a first one of the preparatory steps executed 
individual instructions placed in the respective slots. It is the 5 on an instruction-level parallel processor in parallel with 
task of the compiler to identify which of the instructions are mac hine instructions implementing a second one of the 
independent from each other and may be carried out in preparatory steps for respective ones of the series of pro- 
parallel. These instructions are thus candidates to be placed „ ram instructions 
together in respective slots. An important aspect of this task . , . ' 

of the compiler is the identification of loops in the execution ,„ c 11 15 a ob J. ect of tne "lyeotion to provide a system 
of the program instructions and to move program instruc- for g eneralln S an interpreter of the kind set forth, which 
tions inside the loop. The purpose is to identify which of the interpreter is suitable for faster execution of a program than 
instructions is independent from the others and is, therefore, ^ known interpreter. This object is achieved according to 
a candidate to be executed in parallel with the others. ,he invention by a system for generating an interpreter that 
The textbook 'Compiler: Principles, Techniques and 1S ^characterized m that the compiler is arranged to generate, 
Tools', Alfred V. Aho, Ravi Sethi, and Jeffrey D U 11m an, " for , a P amcular P^ram instruct™ by means of code 
Addison-Wesley Series in Computer Science, Addison- ^P^ation ,n the executable interpreter, a block compnsing 
Wesley Publishing Company, Reading, Mass., 1985, a 'Ration in 0 machine lnst nictions of the execution unit 
describes on pages 602 to 608 how loops in a program code °, r this P arUcular Program instruction, followed by a trans- 
are to be treated for program code optimization by the 20 latton mto machme ms mctions of the preparatory unit for a 
compiler. To enable optimization by the compiler/there 2 ° ^ssor program mstruchon immediately succeedmg the 
should be no jump into the middle of a loop from the outside. P ar,icular P' 0 ^™ instruction so as to obtam the executable 
The only entry into a loop is then via its header. According a dreaded form The system generates the 
to the textbook, the control flow edges of a loop can be « ecutabl « dreaded interpreter from a source code that does 
partitioned into back edges and forward edges. Aback edge „ D0 ' COm ^ thlS threa< ? ed stnx f u: ^ 71,15 allows t[ J e Murce 
has the property of pointing to an entry block of the loop and * ^ si C Wr " t6n ^ programmmg language 
the forward edges are the remaining edges. A loop can be 

optimized if its forward edges form an acyclic graph, i.e. a A version of the method according to the invention is 
graph with no further loops. The structure of a threaded defined in claim 3. Since the generated interpreter is 
interpreter can thus be seen as a control flow graph com- 3D arranged to carry out the machine instructions implementing 
prising a complex arrangement of loops. Through each two of the Preparatory steps in parallel on an instruction- 
block, a loop may pass and after that block the loop may level parallel processor, two different program instructions 
continue at each of the blocks, after which it may continue- arc executed simultaneously during the execution of a 
again at each of the blocks and so on. All control flow edges program by this interpreter. This significantly reduces the 
are forward edges and do not form an acyclic graph. 35 lime needed t0 execute the interpreter interpreting the pro- 
Therefore, this control flow graph of the interpreter can not gram. 

be optimised by the known software pipeline algorithms The data carrier comprising the compiler according to the 
disclosed in the textbook. Despite this teaching, the inven- invention is characterized in that the compiler is arranged to 
tors have found that some of the preparatory steps of a generate, for a particular program instruction by means of 
threaded interpreter can be executed in parallel as described 40 code duplication in the executable interpreter, a block corn- 
above, prising a translation into machine instructions of the execu- 

An embodiment of the method according to the invention tion unit for this particular program instruction, followed by 

is defined in claim 1. In this embodiment, the machine a translation into machine instructions of the preparatory 

instructions implementing the steps for interpreting the utut f° r a successor program instruction immediately suc- 

series of program instructions are executed in a three-stage 45 ceeding the particular program instruction so as to obtain the 

pipeline. This means that three program instructions are executable interpreter in a threaded form, 

interpreted in parallel; this significantly reduces the time Further advantageous embodiments of the invention are 

needed to interpret and execute the program recited in the dependent Claims. 

An embodiment of the method according to the invention 

is defined in claim 1. A byte code format is very suitable for so BRIEF DESCR1P ™N ° F THE INVENTION 

describing and storing the program to be interpreted. The The invention and its attendant advantages will be further 

byte code format allows for easy retrieval and analysis of the elucidated with the aid of exemplary embodiments and the 

program instruction, resulting in a simpler interpreter. accompanying schematic drawings; therein: 

It is a further object of the invention to provide a system p IG 1 shows the control fl ow grapn 0 f a threaded 

for executing an interpreter of the kind set forth which 55 interpreter 

allows faster execution than the known system. This object FIG. 2 schematically shows a part of a Very Long 

is achieved aocordmg to the invenuon by a system for Iristruction Word processor, 

executing a program that is charactenzed in that the threaded * L ■ « t 

interpreter is arranged to have machine instructions imple- FIG ' 3 schematically shows the layout of part of a 

menting a first one of the preparatory steps executed on an 60 *™&* m 10 be exec f ed bv a ^ W P rocessor accordin S "> 

instruction-level parallel processor in parallel with machine the known a PP roacn » 

instructions implementing a second one of the preparatory FIG * 4 shows the execution of the interpreter steps for a 

steps for respective ones of the series of program instruc- number of program instructions according to the invention, 

tions. Since the machine instructions implementing two FIG. 5 schematically shows the layout of part of the 

steps in the interpretation of the series of program instruc- 65 program implementing the stages shown in FIG. 4, 

tions are carried out in parallel on the instruction-level FIG. 6 shows the control flow graph of the interpreter 

parallel processor, the execution of the interpreter is faster. translated from the implementation in C, 
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FIG. 7 shows the control flow graph of the interpreter 
after a first optimization, TABLE I 



FIG. 8 shows the control flow graph of the interpreter Implementation of a block of the threaded interpreter 

after a further optimization, < — 

3 MUL tos, nos -» tos //machine instruction for multiplication 

FIG. 9 shows an embodiment of the system for executing LDB P c++ ~* bc # fctch ncxt b Y tc code and increment program 

a program according to the invention, and ldw tablefbc) - block Se the fetched byte code 

FIG. 10 shows an embodiment of the system for gener- JMPblock ' ^ tQ next block 

ating the interpreter according to the invention. 3 q 

— *> . . . „. , , The left column contains the machine instructions in pseudo 

Corresponding features in the various Figures are denoted assemb , y language aQd {hc ^ co]umn C0E £ ments 

by the same references. on the individual instructions. The first row is the machine 

instruction emulating the program instruction. This is the 

DETAILED DESCRIPTION OF THE multiplication of the element at the top of the stack by the 

PREFERRED EMBODIMENT next element on the stack and the result is put on top of the 

stack. The second row is the fetch instruction for fetching the 

FIG. 1 shows the control flow graph of a threaded next program instruction. The byte that is indicated by the 

interpreter. The threaded interpreter has a number of blocks, program counter pc is loaded from the memory and put in 

of which the blocks 102, 104, 106 and 108 are shown in the the variable bc and the program counter is incremented to 

graph. A block corresponds to a particular type of program 20 prepare for the next program instruction. The third row is the 

instruction that can be interpreted by the interpreter and decode instruction for decoding the fetched program instruc- 

comprises a number of machine instructions to be executed tion. For decoding, us is made of an array containing the 

by the processor. The threaded interpreter has a block for addresses of the various blocks of the interpreter. The 

every type of program instruction that is supported. When a fetched ^ te r > havin S a valu f ° f bet , wee " 0 and 255, is used 

given program instruction is to be interpreted, control is 25 as an index for the array and the indexed word is loaded mto 

passed to the block corresponding to the given type of lhe van able block. Tlie fetch and decode instructions are 

program instruction, e.g. to block 106, and that block is very simple due to the fact that the program instructions are 

executed. At the end of the execution of that block, it is stored . in the memory as byte codes. The fourth row is the 

determined which program instruction is to be carried out J um P "^ruction to the block corresponding to the next 

next and control is passed to the block corresponding to the 30 program instrucUon. 

type of that next program instruction, e.g. to block 102. It is FIG. 2 schematically shows a part of a Very I^ng 

a characteristic of the threaded interpreter that at the end of Instruction Word processor In this example, the processor 

any block control may be passed to any of the other blocks. has five Pactional units, 202-210, which are capable of 

carrying out operations in parallel with respect to each other. 

The contents of a block of the interpreter depends on the 35 The processor also has a number of registers which are 

nature of the storing of the program instructions, i.e. the symbolically grouped in a register file 212. For the execution 

format in which they are stored, and on the processor on of an operation, a functional unit can use the contents of two 

which the interpreter runs. However, in a block of the registers of the register file as input and store the result of the 

threaded interpreter the following elements can be distin- operation in one of the registers of the register file. The 

guished: 40 functional units of a VLIW processor may be uniform in that 

one or more machine instructions emulating the program eacn of j he functional units can carry out each of the 

instruction, i.e. machine instructions that realize the supported operations. However, the functional units may 

purpose of the program instruction; also be DOn - unif °rm m that a certain functional unit can only 

r . . c c . carry out a class of the available operations while another 

a fetch instruction for fetching the next program instruc- 45 ^ion* ^ can carry QUt Qnly a diffijrem dass ^ 

tion from the memory; example in this respect is the situation where one functional 

a decode instruction for decoding the fetched program umt is arranged for memory-related operations and another 

instruction so that the type of program instruction is functional unit is arranged for arithmetic operations, 

determined; A Very Long Instruction Word (VLIW) instruction, sym- 

a jump instruction to the block corresponding to the type 50 bolized by block 214, has five issue slots in which an 

of program instruction. operation to be carried out by a functional unit can be placed. 

In the preferred embodiment of the invention, the pro- The position of an issue slot in the VLIW instruction 

gram instructions are stored in a so-called byte code format. determines which of the functional units is to carry out the 

According to such a format a program instruction is operation placed in that issue slot. In the example shown, an 

uniquely coded into a code that fits in a single byte. This byte 55 operation placed in issue slot 216 will be carried out by 

code indicates the type of operation and may be followed by functional unit 202, an operation in issue slot 218 by 

one or more immediate arguments indicating the operands of functional unit 204, an operation in issue slot 220 by 

the instruction. The fetch and decode instructions are imple- functional unit 206, an operation . in issue slot 222 by 

mcnted in such a way that they can handle program instruc- functional unit 208, and an operation in issue slot 224 by 

lions stored in a byte code format. Application of the 60 functional unit 210. An operation placed in an issue slot, like 

invention, however, is not restricted to programs stored in a operation 226, has an operation code field 228 indicating the 

byte code format. In the case of a different format, the type of operation. Furthermore, the operation 226 has reg- 

implementation of the fetch and decode instruction must ister fie Ids 230 and 232 indicating the two input registers and 

accommodate this different format. The table below shows a register field 234 indicating the output register. The VLIW 

an example of the block for the program instruction for 65 processor operates in cycles, a complete VLIW instruction 

multiplication. The example concerns a byte code format being processed in each cycle. This results in the parallel 

and is given in a pseudo assembly language. execution of the operations placed in the issue slots of the 
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VLIW instruction. For some of the operations, the result is i+1. This is in conformity with the nature of the threaded 

not immediately available at the start of the next cycle. So interpreter as explained in relation to FIG. 1 and Table I. 

a subsequent operation needing that result cannot be sched- Table I shows a block for a particular program instruction, 

uled immediately after such an operation. Examples of such in which block the particular program instruction is executed 

operations for the processor used in the preferred embodi- 5 and a jump to the next program instruction is prepared and 

ment are the load word instruction and the load byte made. So in stage 404 the program instruction i+1 is fetched 

instruction, each taking three cycles, and the jump from the memory and in stage 406 that program instruction 

instruction, which takes four cycles. is decoded. In stage 408 a jump is made to the unit of 

FIG. 3 schematically shows the layout of part of a machine instructions that emulates the program instruction 

program to be executed by a VLIW processor according to 10 i+1. In stage 408 the machine instructions of program 

the known approach. The program can be viewed as a matrix instruction i are also executed. Row 410 shows the steps for 

wherein a row represents a VLIW instruction comprising the the (i+1)'* program instruction: the fetch step in stage 412, 

operations to be Lssued simultaneously. A specific column of the decode step in stage 414 and in stage 416 the jump step 

the matrix represents the operations that are to be carried out and the execution step of the machine instructions emulating 

by the corresponding functional unit. The order of execution 15 the program instruction i+1. Analogously, row 418 shows 

is from the top row 302 down in the order as given in the those steps for program instruction i+2 in stages 420, 422 

matrix, unless a jump instruction imposes that another and 424 respectively. In the figure, time is represented from 

instruction is to be executed. The effect of the jump left to right and the stages for a program instruction are 

instruction, i.e. the jump to the specified address, occurs executed from left to right, e.g. the fetch step of a program 

after the latency of the jump instruction has lapsed. Below, 20 instruction is executed prior to its decode step and the 

the execution of a block of the threaded interpreter on a decode step of a program instruction is executed prior to its 

VLIW processor will be illustrated while using the four jump step. 

machine instructions given in Table I. In practice, some The stages that are shown above one another are carried 
operation other than the ones originating from that table may out in parallel on a VLIW processor. So the jump and the 
be scheduled in a free issue slot but this is not shown for 25 execution step related to program instruction i of stage 408 
reasons of clarity and is of no significance for explaining the are carried out simultaneously with the decode step related 
invention. The MUL operation and the LDB operation can to program instruction i+1 of stage 414 and simultaneously 
be scheduled in the first VLIW instruction since these two with the fetch step relating to program instruction i+2 of 
operations do not depend on each other. The 1 MUL operation stage 420. When this column of stages has been executed, 
is the realisation of the present program instruction, whereas 30 the next iteration takes place, the stages 416, 422 and 426 
the LDB operation is the fetching of the next program then being executed in parallel. In these stages, the same 
instruction. The LDW operation cannot yet be scheduled steps as in the previous iteration are executed, but now for 
since it requires the result of the LDB operation and the JMP the successor program instructions. The rows and stages of 
operation cannot yet be scheduled since it requires the result FIG. 3 are also referred to as the pipeline of the interpreter 
of the LDW operation. It takes three cycles before the result 35 or more particularly as the software pipeline. In the preferred 
of the LDB operation becomes available and, therefore, the embodiment, a three-stage pipeline is employed, meaning 
LDW operation is issued in the fourth VLIW instruction, that three program instructions are being executed in parai- 
indicated by row 304. It takes three cycles before the result lei. The interpretation of the program instructions in a 
of the LDW operation becomes available. The JMP opera- software pipeline as described above operates on the 
tion is, therefore, issued in the seventh VLIW instruction, 40 assumption that the next program instruction to be inter- 
indicated by row 306. Since it takes four cycles before the preted is the one that immediately succeeds the current one. 
result of the JMP operation is effectuated, the execution of In case of a jump program instruction, this assumption is not 
the whole block specified by table I takes at least ten cycles valid and the flow of interpreted program instructions will be 
of the VLIW processor. On average, the operations emulat- different from the sequential order. In this case, the pipeline 
ing the program instruction, like the MUL operation, require 45 is initialized and operating the pipeline starts with the 
two cycles each. Furthermore, for almost all program program instruction to which the jump has been made, 
instructions the emulating operations require less than ten It is to be noted that the stages in FIG. 4 contain parts of 
cycles. Therefore, one can say that the execution of a single program instructions that are part of the program that is 
block of the threaded interpreter requires ten cycles of the being interpreted. The stages contain parts of machine 
VLIW processor. The fact that the LDW operation can only 50 instructions that are carried out by the processor. In other 
be executed when the result of the LDB operation has words, FIG. 4 shows a software pipeline during execution of 
become available and, therefore, depends on the LDB the interpreter according to the invention and does not show 
operation, is indicated by arrow 308 from the LDB operation a hardware pipeline for execution of machine instructions by 
to the LDW operation. In the same way, arrow 310 indicates a processor. 

that the JMP operation depends on the LDW operation. 55 FIG. 5 schematically shows the layout of part of the 

FIG. 4 shows the execution of the interpreter steps for a program implementing the stages shown in FIG. 4. This part 

number of program instructions according to the invention. shows the three stages of respective program instructions 

The execution of the steps in the interpreter for interpreting that arc executed in parallel, in this case being stage 408, 

a program instruction is depicted by a row. Furthermore, this stage 414 and stage 420. The JMP operation 502 and the 

execution is partitioned in a number of stages, depicted by 60 MUL operation 504 are scheduled in the first VLIW instruc- 

respective fields in the row. For the i th program instruction, tion. These operations can be executed in parallel since they 

row 402 has a stage 404 for the fetch step, a stage 406 for do not depend on each other. The MUL operation is the 

the decode step and a stage 408 for the jump step and the realisation of program instruction i and the JMP operation is 

execution step. It is to be noted that the execution step is the jump to program instruction i+1. The MUL operation 

intended for executing the machine instructions emulating 65 and the JMP operation correspond to stage 408 of the i th 

program instruction i, whereas the fetch step, the decode program instruction. The JMP operation 502 takes four 

step and the jump step are operating on program instruction cycles to complete and therefore in order to complete the 
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stage at least three cycles must follow the cycle in which the such a processor is called an instruction- level parallel pro- 

JMP operation is scheduled. The LDW operation 506 is cessor. A VLIW processor belongs to a particular subclass of 

scheduled in the second VLIW instruction and implements the class of instruction-level parallel processors, 

the decode belonging to program instruction i+1, as shown The interpreter in the preferred embodiment has been 

in stage 414. The LDW operation takes three cycles to 5 written in the programming language C in order to make it 

complete and may, therefore, be scheduled in the first or a portable program usable on different processors. It is not 

second VLIW instruction without affecting the length of the possible to directly implement a threaded interpreter in 

program fragment since the JMP operation 502 takes four ANSI C, since this language lacks variable labels. Therefore, 

cycles anyway. The LDB operation 508 is scheduled in the at the end of a block of the threaded interpreter it is not 

first VLIW instruction and implements the fetch belonging 1Q possible to implement a jump instruction to a block that is 

to program instruction i+2, as shown in stage 420. The LDB to be determined at runtime. 

operation lakes three cycles to complete and may, therefore, Therefore, the interpreter has been implemented in ANSI 

be scheduled in the first or second VLIW instruction without a switch statement which is contained within an endless 

affecting the length of the program fragment, since the JMP ' while loo P and has been compiled by a compiler that has 

operation 502 takes four cycles. optimized and rearranged the compiled instructions in order 

To summarize the relation between the operations shown 1 10 ODtain a threaded interpreter, 

in FIG. 5 and the steps of the interpreter shown in FIG. 4 and The table below shows one block of the interpreter in 

to explicitly show on which program instructions the opera- ANSI C. 
lions work: 

the MUL operation belongs to the execution of program 20 TABLE II 

instruction i and emulates program instruction i, . c ^ ..- . • . 

r & Block of the interpreter for the multiply program instruction 

the JMP operation belongs to the execution of program : 

instruction i and jumps to the block of program instruc- while t 1 ) { 

tion i+1 switch m { 

llon 1+A ' ... //other cases 

the LDW operation belongs to the execution of program 25 case 0 x 4e: //multiply operation 

instruction i+1 and decodes program instruction i+2, ">s - tos x nos; //emulation code 

an£ j nos - sp[2]; //update slack cache 

, T . , , sp +- 1; //update stack pointer 

the LDB operation belongs to the execution of program b0 . bl . bl = b2 . b2 . b3; //shift pre -f e tch pipeline 

instruction i+2 and fetches program instruction i+3. b3 = b4; m = b5; 

The execution of the next three stages is carried out in a 30 bS 0 bfi ; b6 - P^ 7 !' //pre-fetch bytes 

next block of VUW instructions, similar to the ones shown ^ P c k /Update program counter 

. , , wpragma rCS-graft_here //to create threaded 

in FIG. 5 and operating on respective next program instruc- interpreter 

tions. The JMP operation 502 is dependent on the LDW break; 

operation of a previous block of VLIW instructions and not - • //other cases 

on the LDW operation 506 of the present block. This 35 ^ 

dependence is illustrated by arrow 510, which is drawn with 

a loop outside the matrix to indicate dependence on a The implementation as a pipelined interpreter has been 

previous iteration, i.e. a previous block of VLIW instruc- realised by explicitly maintaining a set of pre-fetched bytes 

tions. Since Ihe previous block has been completely finished bO, . . , , bn in the interpreter. The argument of the interpreter 

prior to the start the present block, the JMP operation may 40 switch is bO, being the byte code of the program instruction 

be scheduled immediately at the start of the present block. to be interpreted. The immediate arguments of that program 

In the same way, the LDW operation 506 is dependent on the instruction are bl, . . . , bm, where m is the number of 

LDB operation of the previous block and not on the LDB immediate arguments the program instruction requires. In 

operation 508 of the present block. This dependence is the rare case that a program instruction requires more than 

indicated by arrow 512. 45 n byte immediate arguments, the missing m-n bytes are 

The operations in a single block are not dependent on each fetched from memory. Determining the value for n involves 

other and are scheduled in such a way that the whole block a trade-off between the amount of instructions required to 

requires as few cycles of the VLIW processor as possible. In shift the pre-fetched bytes and the chance that insufficient 

the example, the JMP operation requires four cycles to pre-fetching slows down the pipeline. It has been found 

complete. The other operations, in particular the MUL 50 empirically that six is a suitable value for n. After bO, . . . , 

operation, are finished earlier or at the latest at the same bm have been used, the pre-fetch pipeline is shifted by m+1 

instant and therefore the whole block takes four cycles. This positions and m+1 new bytes are fetched from the memory 

will be the same for other types of program instructions as Once bytes are pre-fetched sufficiently ahead, the compiler 

well, as long as the operation or operations emulating the can move decode load operations to preceding iterations as 

program instruction, such as the MUL operation in the 55 described below. The pragma "TCS-graft_here" is an 

example, require four cycles at the most. In practice, this is instruction to the compiler that such optimization is to be 

true for most types of program instruction. This means that carried out there. 

the scheduling of steps and operations as shown in the FIGS, FIG. 6 shows the control flow graph of the interpreter 

4 and 5 has reduced the interpretation of a program instruc- translated from the implementation in C. A control flow 

tion on a VLIW processor from ten cycles, as shown in FIG. 60 graph shows the structure and the possible flows of the 

3, to four cycles. program translated by the compiler. The control flow graph 

The preferred embodiment of the invention concerns the contains basic blocks and control flow edges. A basic block 

execution of the threaded interpreter on the Philips VLIW contains a number of instructions, each of which is executed, 

processor TM 1000. However, the invention can also be in the order given, when control is passed to that basic block, 

carried out on another type of processor allowing machine 65 A control flow edge indicates how control can be passed 

instructions to be executed in parallel. This technique is from one basic block to the other. Basic block 602 is the 

generally called instruction-level parallel processing and range check to verify whether the switch argument bO 
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corresponds to one of the cases of the switch statement. If 
this is not so, control passes to basic block 604 for handling 
this exception. If bO has a corresponding case, control passes 
to basic block 605 containing the switch and after that to the 
basic block of the relevant case, e.g. to basic block 608. 
After each of the basic blocks 604 to 610, control passes to 
basic block 612, which is a jump back to basic block 602. 
This jump reflects the endless while loop as given in the C 
program. 

FIG. 7 shows the control flow graph of the interpreter 
after a first optimization. This first optimization by the 
compiler can be applied if the switch contains nearly 256 
cases, so if the number of types of program instruction 
supported by the interpreter is nearly 256. The number of 
cases is then increased to 256 by way of a number of dummy 
cases. This means that for any value of the byte code 
corresponding to the program instruction, a valid basic block 
is available in the switch and that the range check can be 
dispensed with. The control flow graph then directly starts 
with the switch in basic block 702. After that, control passes 
to one of the basic blocks 704 to 710 depending on the value 
of the byte code. As before, control then always passes to the 
jump in basic block 712. 

FIG. 8 shows the control flow graph of the interpreter 
after a further optimization. The switch of basic block 702 
is moved backwards in the loop and, together with the jump 
of basic block 712, added to each of the basic blocks 704 to 
710. The switch plus the jump represent the decode of a 
program instruction and the jump to the block of that 
program instruction. This optimization results in a control 
flow graph with 256 basic blocks, of which are shown basic 
blocks 802 to 808, and control flow edges from each of the 
blocks to each of the blocks. A basic block contains the 
following elements: 

execution of the machine instructions emulating the pro- 
gram instruction; 

a fetch of the byte code of the next program instruction; 

a decode of that byte code; 

a jump to the block corresponding to the decoded program 
instruction. 

The foregoing corresponds to the implementation of a 
threaded interpreter as described in relation to FIG. 1. The 
compiler according to the invention has thus formed a 
threaded interpreter from a program source code in ANSI C 
that did not contain the threaded structure. 

For interpretation of a program comprising a series of 
program instructions, the basic blocks are executed repeat- 
edly in iterations until the program terminates. In order to 
realize the pipelining as described with reference to FIG. 4, 
the compiler moves instructions from one iteration to 
another in a manner described below. The decode instruction 
is moved by the compiler to the preceding iteration. This 
means that the decode instruction of a block of a given 
program instruction relates to the program instruction which 
is located one position later in the series of program 
instructions, since the decode instruction has been moved 
thereto from the next block. The fetch instruction is moved 
back two iterations due to the pre-fetch pipeline specified in 
the C program above. This means that the fetch instruction 
of a block of a given program instruction relates to the 
program instruction which is located two positions later in 
the series of program instructions, since the fetch instruction 
has been moved to this block from two blocks later. Moving 
these instructions to the other iterations makes them inde- 
pendent from the instructions inside a given block. This 
allows the parallel processing of the instructions as 
described in relation to FIG. 5. 
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The compiler moves an instruction of a block back to the 
previous iteration by duplicating that instruction to all 
possible predecessors of the basic block at hand. In case of 
the threaded interpreter, this means duplicating the instruc- 

5 tion from a particular block to all other blocks since the 
given block may be arrived at from each of the other blocks. 
Since every block is modified in this way, i.e. every block 
plays the role of the particular block once, each block 
receives multiple copies of the instruction to be moved. The 
instruction to be moved is the decode instruction which 
produces the address of the block to be jumped to next. The 
decode instruction receives as input a byte which is the byte 
code of the next program instruction. This may be the next 
byte in the pre-fetch pipeline or a later byte if one or more 
of the bytes in the pre-fetch pipeline is an immediate 

15 argument. So the exact implementation of the decode 
instruction of a block depends on the type of program 
instruction of that block, since different types may have a 
different number of immediate arguments; therefore, a num- 
ber of different versions of the decode instruction exist 

20 among the multiple copies. The compiler removes the dupli- 
cate copies from the decode instructions that are moved to 
a block and only the different versions remain to be executed 
in that block. 

Realizing the desired pipeline for executing the threaded 

25 interpreter is thus based on two features. The first feature is 
the pre-fetch pipeline of 6 bytes as coded in the C program, 
which allows moving the fetch step of a program instruction 
two iterations backwards. The actual moving of the fetch 
step is carried out by the compiler; this is a straightforward 

30 task given the pre-fetch pipeline. The second feature is 
moving the decode instruction of a program instruction one 
iteration backwards. The compiler carries out this move by 
duplicating the relevant machine instructions from all blocks 
to all other blocks and by removing the duplicate instruc- 

35 tions from a block. 

FIG. 9 shows an embodiment of the system for executing 
a program according to the invention. The system 900 is 
implemented according to a known architecture. The system 
may be a workstation, a consumer apparatus like a television 

40 set, or any other type of apparatus having the required 
resources. The system has a VLIW processor 902 for car- 
rying out the machine instructions of program module 
loaded in memory 904. This memory may be random access 
memory or a combination of random access memory and 

45 read only memory. The system further has an interface 906 
for communication with peripheral devices. There is a bus 
908 for the exchange of commands and data between the 
various components of the system. The peripheral devices of 
the system include a storage medium 910 containing the 

50 program to be interpreted. Alternatively, this program may 
be stored in the read only memory of the system. The storage 
medium 910 may be a hard disk or other suitable medium, 
like an optical disc, a chip card or a tape. The peripheral 
devices of the system further include a display 912 and an 

55 input device 914 for communicating with a user of the 
system. The system has a threaded interpreter 916 as 
described above. The interpreter has a preparatory unit 918 
that is arranged to retrieve a program instruction from the 
memory 904 and to make it available for processing by the 

60 interpreter. Furthermore, the interpreter has a unit 920 
comprising the machine instruction or instructions that emu- 
late the retrieved program instruction. The program to be 
executed comprises a series of program instructions and is 
loaded into memory 904 for interpretation and execution by 

65 interpreter 916. 

FIG. 10 shows an embodiment of the system for gener- 
ating the interpreter according to the invention. The system 
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1000 is implemented according to a known architecture. The 
system is a workstation based on a general-purpose 
computer, but another type of computer may also be used. 
The system has a processor 1002 for carrying out the 
machine instructions of a program module loaded in 
memory 1004. The system also includes an interface 1006 
for communication with peripheral devices. There is a bus 
1008 for the exchange of commands and data between the 
various components of the system. The peripheral devices of 
the system include a storage medium 1010 containing the 
source of the interpreter to be compiled. The resultant 
executable compiler is also stored on the storage medium 
1010. The storage medium 1010 may be a hard disk or other 
suitable medium, like an optical disc, a chip card or a tape. 
The peripheral devices of the system also include a display 
1012 and an input device 1014 for communicating with a 
user'of the system. The system includes a compiler 1016 as 
described above. 
What is claimed is: 

1. A system for generating an executable interpreter for 
interpreting a program comprising a series of program 
instructions, the system comprising a compiler for translat- 
ing the interpreter from a source code into machine 
instructions, the interpreter in the source code comprising: 
a preparatory unit for executing at least one preparatory 
step making one of the program instructions available 
in the interpreter, and 
an execution unit with emulation code for emulating one 
of the program instructions, characterized in that the 
compiler is arranged to generate, for a particular pro- 
gram instruction by means of code duplication in the 
executable interpreter, a block comprising 
a translation into machine instructions of the execution 
unit for this particular program instruction, followed 
by 

a translation into machine instructions of the prepara- 
tory unit for a successor program instruction imme- 
diately succeeding the particular program instruction 
so as to obtain the executable interpreter in a 
threaded form, 
wherein the compiler is arranged: 

to generate the threaded interpreter arranged to be 
executed on an instruction-level parallel processor in 
repeated iterations, and 

to generate the threaded interpreter arranged to have 
machine instructions implementing a first one of the 
preparatory steps executed in parallel with machine 
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instructions implementing a second one of the pre- 
paratory steps for respective ones of the series of 
program instructions by moving the machine instruc- 
tions implementing the first one of the preparatory 
5 steps to an immediately preceding iteration. 

2. A system as claimed in claim 1, wherein the compiler 
is arranged to move the machine instructions implementing 
the first one of the preparatory steps to an immediately 
preceding iteration for each of the blocks and wherein the 

10 compiler is arranged to remove duplicate copies of machine 
instructions in a particular block resulting from such mov- 
ing. 

3. A compiler for generating an executable interpreter for 
interpreting a program comprising a series of program 

35 instructions implemented in a processor, the compiler being 
arranged to translate the interpreter from a source code into 
executable machine instructions, the interpreter in the source 
code comprising: 

a preparatory unit for executing at least one preparatory 
20 step making one of the program instructions available 
in the interpreter, and 
an execution unit with an emulation code for emulating 
one of the program instructions, 
characterized in that the compiler is arranged to generate, for 
25 a particular program instruction by means of code duplica- 
tion in the executable interpreter, a block comprising 
a translation into machine instructions of the execution 
unit for this particular program instruction, followed by 
30 a translation into machine instructions of the preparatory 
unit for a successor program instruction immediately 
succeeding the particular program instruction, so as to 
obtain the executable interpreter in a threaded form, 
wherein the compiler is arranged: 

35 to generate the threaded interpreter arranged to be 
executed on an instruction-level 
parallel processor in repeated iterations, and 
to generate the threaded interpreter arranged to have 

40 machine instructions implementing a first one of the 
preparatory steps executed in parallel with machine 
instructions implementing a second one of the prepa- 
ratory steps for respective ones of the series of program 
instructions by moving the machine instructions imple- 

45 menting the first one of the preparatory steps to an 
immediately preceding iteration. 

* * * * * 
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